In this lab, we will first examine the various method for constructing a spatial weight matrix, then go through the process of analyzing and modeling datasets of areal data. Before starting the lab, you will need to set up a new folder for your working directory. Go to your geog6000 folder now and create a new folder for today’s class called lab12. The following files will be used in this lab, all available on Canvas:

  • New York leukemia data set (NY_data.zip)
  • Housing prices in Boston (boston.tr.zip - note this is a different file to one we have previously used, this contains the tract boundaries)
  • Crime rates in Columbus, Ohio (columbus.zip).
  • Prices and tax rates on used cars (usedcars.zip)

download these files from Canvas, and move them from your Downloads folder to the datafiles folder that you made previously. Make sure to unzip the zip files so that R can access the content. Note that on Windows, you will need to right-click on the file and select ‘Extract files’.

Now start RStudio and change the working directory to lab12. As a reminder, you can do this by going to the [Session] menu in RStudio, then [Change working directory]. This will open a file browser that you can use to browse through your computer and find the folder.

You will also need several packages to carry out all the steps listed here: sf, spdep and spatialreg. Make sure these are installed (if you haven’t already) before starting. I’m going to use tmap to map out some of the results, but you are welcome to use ggplot2 or the basic plot() command.

pkgs <- c("sf",
          "spdep",
          "spatialreg",
          "tmap")

install.packages(pkgs)
library(RColorBrewer)
library(sf)
library(spdep)
library(spatialreg)
library(tmap)

With all the examples given, it is important to not just type in the code, but to try changing the parameters and re-running the functions multiple times to get an idea of their effect. Help on the parameters can be obtained by typing help(functionname) or ?functionname.

Reading the data

We’ll start by using sf to read in and plot the geometries of the Boston housing dataset and the New York dataset using sf.

boston <- st_read("../datafiles/boston.tr/boston.shp", quiet = TRUE)
plot(st_geometry(boston))

NY8 <- st_read("../datafiles/NY_data/NY8_utm18.shp", quiet = TRUE)
plot(st_geometry(NY8))

As a reminder, you can see the attribute table for the polygons by simply typing the name of the sf object, or by using the $ notation to extract individual variables:

head(NY8)
## Simple feature collection with 6 features and 17 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 421423.4 ymin: 4661351 xmax: 427091.6 ymax: 4664822
## Projected CRS: WGS 84 / UTM zone 18N
##          AREANAME     AREAKEY        X        Y POP8 TRACTCAS  PROPCAS
## 1 Binghamton city 36007000100 4.069397 -67.3533 3540     3.08 0.000870
## 2 Binghamton city 36007000200 4.639371 -66.8619 3560     4.08 0.001146
## 3 Binghamton city 36007000300 5.709063 -66.9775 3739     1.09 0.000292
## 4 Binghamton city 36007000400 7.613831 -65.9958 2784     1.07 0.000384
## 5 Binghamton city 36007000500 7.315968 -67.3183 2571     3.06 0.001190
## 6 Binghamton city 36007000600 8.558753 -66.9344 2729     1.06 0.000388
##   PCTOWNHOME PCTAGE65P        Z  AVGIDIST PEXPOSURE   Cases       Xm       Ym
## 1  0.3277311 0.1466102  0.14197 0.2373852  3.167099 3.08284 4069.397 -67353.3
## 2  0.4268293 0.2351124  0.35555 0.2087413  3.038511 4.08331 4639.371 -66861.9
## 3  0.3377396 0.1380048 -0.58165 0.1708548  2.838229 1.08750 5709.063 -66977.5
## 4  0.4616048 0.1188937 -0.29634 0.1406045  2.643366 1.06515 7613.831 -65995.8
## 5  0.1924370 0.1415791  0.45689 0.1577753  2.758587 3.06017 7315.968 -67318.3
## 6  0.3651786 0.1410773 -0.28123 0.1726033  2.848411 1.06386 8558.753 -66934.4
##     Xshift  Yshift                       geometry
## 1 423391.0 4661502 POLYGON ((421840.4 4662874,...
## 2 423961.0 4661993 POLYGON ((421826.4 4663108,...
## 3 425030.6 4661878 POLYGON ((423159 4664759, 4...
## 4 426935.4 4662859 POLYGON ((425261.8 4663493,...
## 5 426637.5 4661537 POLYGON ((425033.3 4662995,...
## 6 427880.3 4661921 POLYGON ((426178.8 4664216,...
hist(NY8$TRACTCAS, 
     xlab = "Number of cases/tract")

We can also make thematic maps by specifying the column we wish to plot:

my.pal <- brewer.pal(9, "YlOrRd") 

plot(NY8["Cases"], 
     main = "New York Leukemia Cases",
     col = my.pal)

Building the spatial weight matrix

In order to test the different methods shown in class, we will extract the polygons for the Syracuse area from the NY8 dataset. We then get their geometry using st_geometry(), and their centroids with the st_centroid() function. We’ll use both of these to visualize the neighborhood structure.

Syracuse <- NY8[NY8$AREANAME == "Syracuse city", ]
Syracuse.geom <- st_geometry(Syracuse)
Syracuse.coords <- st_centroid(Syracuse.geom)
plot(Syracuse.geom, reset = FALSE)
plot(Syracuse.coords, pch = 16, col = 2, add = TRUE)

Neighborhood functions

The following section lists various methods for producing neighborhood structures in R. Functions to plot these are only shown for the first - you are encouraged to try plotting several of these to understand which areas are considered as neighbors.

Boundary methods

Queen’s case

Here two regions are considered neighbors if their boundaries or corners touch, and can be found using the poly2nb() function:

Sy1_nb <- poly2nb(Syracuse)

To visualize the resulting structure, plot the geometry of the original polygons, then add the structure - this requires the neighborhood, the centroids, and various color and line width options:

plot(Syracuse.geom, reset = FALSE)
plot(Sy1_nb, Syracuse.coords, add = TRUE, col = 2, lwd = 1.5)

Use this code to visualize the following structures as well.

Rooks’s case

Here two regions are considered neighbors only if a contiguous part of their boundaries touch.

Sy2_nb <- poly2nb(Syracuse, queen = FALSE)

Centroid methods

Delaunay triangulation

Here a triangulation is built between sets of three points. A set of three points is joined as long as no other points are found in a circle fit to the three points:

Sy3_nb <- tri2nb(Syracuse.coords)

Sphere of influence

The sphere-of-influence method restricts the links formed by Delauney triangulation to a certain length.

Sy4_nb <- graph2nb(soi.graph(Sy3_nb, Syracuse.coords))

\(k\) nearest neighbors

Here each region is linked to its \(k\) nearest neighbors irrespective of the distance between them

Sy5_nb <- knn2nb(knearneigh(Syracuse.coords, k = 1))
Sy6_nb <- knn2nb(knearneigh(Syracuse.coords, k = 2))

Distance functions

Distance functions link together two regions if their centroids are within a certain distance of each other. This requires two parameters: d1, the minimum distance, and d2, the maximum distance. Here we will set the maximum to 75% of the largest distance obtained when using 2 nearest neighbors. We obtain all the distances between nearest neighbor pairs in the first line of code (this extracts distances as a list with nbdists() and converts them into a vector with unlist()). Then we find the maximum of these distances. Finally we set d2 to this value \(*0.75\).

dists <- nbdists(Sy6_nb, Syracuse.coords)
dists <- unlist(dists) #list to vector

max_1nn <- max(dists)

Sy7_nb = dnearneigh(Syracuse.coords, d1 = 0, d2 = 0.75 * max_1nn)

Editing neighborhood structures

While these function provide a quick way to establish a neighborhood structure, they will likely include some connections that are unrealistic or exclude some real connections. It is possible to edit the neighborhood structure directly - this is simply a list of \(n\) vectors, where each vector contains the neighbors linked to a single polygon. To see the contents, simply type:

str(Sy1_nb)

And to access the first polygons neighbors:

Sy1_nb[[1]]
## [1]  2  5 11 21 22

While editing by hand is possible, it is not easy. A simpler way is to use the edit.nb() function that allows interactive editing of a neighborhood structure. Instructions for using this are given in the appendix below.

Spatial weights

The calculation of spatial weights from the neighborhood function can be done in several ways. In the following code, the first method does no standardization (binary weights) and the second does row standardization, so that for any given area, each neighbor is downweighted by the total number of neighbors. The function we use is nb2listw() which returns the sparse matrix giving the links between neighbors and the weight of this link. Non-neighbors get a weight of zero.

Sy1_lw_B <- nb2listw(Sy1_nb, style = 'B')
Sy1_lw_W <- nb2listw(Sy1_nb, style = 'W')

A more sophisticated method would be to use some knowledge of the dataset. Weights could be assigned by an inverse distance method, where closer neighbors have higher weights. To do this, we need the list of distances along each join for the neighborhood structure that we will use. In this case, we use the first one generated, the Queen’s case. First, we extract the distances from the neighborhood list (nbdists()), to obtain a list of distances. We then generate inverse distances by dividing by 1000 (to make these a little more manageable) and taking the reciprocal. This is a little complex as we obtain a list output from nbdists(). We create a new function, invd(), which calculates the inverse distance in kilometers (the coordinates are in meters). Then we apply this function to each item in the list using lapply(). This is the list version of the function apply() that we have used earlier.

dists <- nbdists(Sy1_nb, Syracuse.coords)

inverse_distance <- function(x) {1/(x/1000)}

idw <- lapply(dists, inverse_distance)

Sy1_lw_idwB <- nb2listw(Sy1_nb, glist = idw, style = "B")

Checking for autocorrrelation

From here on, we will concentrate on analyzing the Boston housing price data set. The median house price values that we wish to model have a skewed distribution (look at a histogram to see this), so we will create a new variable, containing log house prices:

boston$logCMEDV <- log(boston$CMEDV)
hist(boston$logCMEDV)

plot(boston["logCMEDV"])

We’ll also plot the map of this variable using tmap:

tm_shape(boston) + 
  tm_fill("logCMEDV") +
  tm_borders()

We also make up a neighborhood structure, using the Queen’s case contiguity method

boston.nb <- poly2nb(boston, queen = TRUE)
coords <- st_centroid(st_geometry(boston))

plot(st_geometry(boston), reset = FALSE)
plot(boston.nb, coords, add = TRUE, col = "gray")

And finally convert this to a spatial weight matrix

boston.listw = nb2listw(boston.nb)

Global Moran’s \(I\)

We will start by looking for global autocorrelation in the housing price data. Use the moran.test() function for this. Note we set the randomization assumption to be true as we know nothing about the larger spatial trends in this dataset. The function requires the variable and a spatial weight matrix. Look for the value of Moran’s \(I\), and the \(z\)-score (standard deviate) and associated \(p\)-value:

moran.test(boston$logCMEDV, 
           listw = boston.listw, 
           alternative = "two.sided", 
           randomisation = TRUE)
## 
##  Moran I test under randomisation
## 
## data:  boston$logCMEDV  
## weights: boston.listw    
## 
## Moran I statistic standard deviate = 27.06, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.7270399706     -0.0019801980      0.0007258269

Purely for the sake of comparison, we’ll also calculate Moran’s \(I\) under the normality assumption, stating that the pattern of house prices in Boston is a subset of a larger spatial trend. You should see a slight change in the variance calculated, but not enough to affect our conclusion that the data are strongly autocorrelated.

moran.test(boston$logCMEDV, 
           listw = boston.listw, 
           alternative = "two.sided", 
           randomisation = FALSE)
## 
##  Moran I test under normality
## 
## data:  boston$logCMEDV  
## weights: boston.listw    
## 
## Moran I statistic standard deviate = 27.038, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.7270399706     -0.0019801980      0.0007270156

We can also calculate the significance of the observed autocorrelation using a Monte Carlo method. Here, we randomly redistribute the values across the locations several hundred times, recalculating Moran’s \(I\) each time. Once done, we look at the rank of the observed version of Moran’s \(I\) against those obtained from random resampling. If we are either at the high or low end of all these random realizations, it is highly likely that the observed distribution is significantly autocorrelated. As we are using a rank, we cannot use a two-sided test, but must specify if we believe the autocorrelation to be positive (‘greater’) or negative (‘less’). The number of simulations to run is given by the parameter nsim. Increasing this, will increase the precision of the \(p\)-value obtained (but take longer to run):

moran.mc(boston$logCMEDV, 
         listw = boston.listw, 
         nsim = 999, 
         alternative = 'greater')
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  boston$logCMEDV 
## weights: boston.listw  
## number of simulations + 1: 1000 
## 
## statistic = 0.72704, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater

Now make the Moran scatterplot. This shows the relationship between the value of any given area and its neighbors. The slope of the fitted line is the value of Moran’s \(I\):

moran.plot(boston$logCMEDV, 
           boston.listw, 
           labels = as.character(boston$ID), 
           xlab = "Log Median Value", 
           ylab = "Lagged Log Median Value")

Local Moran’s \(I\)

We will now calculate the local Moran’s \(I\) to examine the spatial pattern of autocorrelation. This returns a statistic (and \(z\)-score) for each area considered.

lm1 = localmoran(boston$logCMEDV, 
                 listw = boston.listw, 
                 alternative = "two.sided")

head(lm1)
##             Ii          E.Ii      Var.Ii        Z.Ii Pr(z != E(Ii))
## 1 -0.224510839 -2.873020e-04 0.017914316 -1.67525561    0.093884090
## 2 -0.001798936 -2.175813e-05 0.002735961 -0.03397629    0.972896060
## 3  0.137484323 -9.177414e-05 0.005723568  1.81848429    0.068990146
## 4  0.033966647 -8.277718e-05 0.010408122  0.33375177    0.738566878
## 5  0.491120158 -4.043873e-04 0.022365621  3.28665963    0.001013833
## 6  0.002533256 -2.287544e-05 0.001909980  0.05848826    0.953359715

The results may be extracted and plotted. The \(z\)-scores are in the fourth column of the output, and the \(p\)-values in the fifth column:

boston$zscore <- lm1[,4]
boston$pval <- lm1[,5]

Now let’s plot these results. First the \(z\)-scores:

tm_shape(boston) + 
  tm_fill("zscore", palette = "Reds", style = "jenks", n = 6) +
  tm_borders() +
  tm_layout(main.title = "Local Moran's I (z-scores)",
            legend.position = c("left", "bottom"))

For the \(p\)-values, we’ll convert these into a binary vector, with ones if the \(p\)-value is below 0.05, zeros otherwise. Plotting these shows clearly the areas with high autocorrelation, by the harbour and to the west of the city.

boston$pval.bin <- as.factor(ifelse(boston$pval < 0.05, "Significant", "Not-significant"))

tm_shape(boston) + 
  tm_fill("pval.bin") +
  tm_borders() +
  tm_layout(main.title = "Local Moran's I (z-scores)",
            main.title.size = 1,
            legend.position = c("left", "bottom"))

Finally, we will use the Getis-Ord \(G^*\) statistic to look at local variation in values, but identifying regions with clusters of high or low values. The \(G^*\) statistic includes the value of the location of interest as well as the neighbors. To account for this, we need first to make a new spatial weight matrix which includes a weight for the link of a location to itself (with the include.self() function). The statistic is calculated using the localG() function, which takes as input, the variable of interest and the spatial weight matrix.

boston.listwGs <- nb2listw(include.self(boston.nb), style = "B")
boston$lG <- as.numeric(localG(boston$logCMEDV, boston.listwGs))

The output is the \(G^*\) statistic, converted to a \(z\)-score. Now we plot the results.

tm_shape(boston) + 
  tm_fill("lG", palette = "RdYlBu", style = "jenks", n = 6) +
  tm_borders() +
  tm_layout(main.title = "Local Getis-Ord G(*) hot/cold spots",
            main.title.size = 1,
            legend.position = c("left", "bottom"))

The results show a very clear structure with clusters of low prices in the center of Boston, but interrupted by a cluster of higher prices running along the river to the harbor. Other clusters, but of high values, are found in the suburbs to the west.

boston$lG.sig <- ifelse(boston$lG < -1.96, 
                        -1,
                        ifelse(boston$lG > 1.96, 1, 0))

boston$lG.sig <- factor(boston$lG.sig, labels=c("Cold", "Not-significant", "Hot"))
tm_shape(boston) + 
  tm_fill("lG.sig", palette = "-RdYlBu", style = "jenks", n = 6) +
  tm_borders() +
  tm_layout(main.title = "Local Getis-Ord G(*) hot/cold spots",
            main.title.size = 1,
            legend.position = c("left", "bottom"))

Spatial regression models

Reading and plotting the data

The spatialreg package has functions for building spatial regression models. We will build several spatial regression models using the crime dataset from Columbus. The dataset is available as a shapefile in columbus.zip - download and extract this before starting.

col <- st_read("../datafiles/columbus/columbus.shp", quiet = TRUE)

Our goal here is to model the crime rate using information about household income (INC) and price (HOVAL). As both of these are right skewed, we’ll log-transform them for the analysis:

col$lINC <- log(col$INC)
col$lHOVAL <- log(col$HOVAL)
  • Plot the crime rate date (column CRIME) by adapting the code given above to plot the NY and Boston datasets.

Building the spatial weight matrix

Now build the neighborhood structure and the associated spatial weight matrix. The example below uses the Queen’s case definition of neighbors and binary weights, but these can be replaced by other options fairly easily.

col.geom <- st_geometry(col)
col.coords <- st_centroid(col.geom)
col.nbq <- poly2nb(col)
plot(col.geom, reset = FALSE)
plot(col.nbq, col.coords, add = TRUE)

Now convert this to a spatial weight matrix:

col.listw <- nb2listw(col.nbq)

Checking for autocorrelation

Use the code given above to look for spatial autocorrelation in the CRIME variable, using both Global and Local Moran’s \(I\).

Spatial regression

OLS regression

Next, we make an OLS regression, excluding all spatial information:

col.fit1 <- lm(CRIME ~ lINC + lHOVAL, data = col)
summary(col.fit1)
## 
## Call:
## lm(formula = CRIME ~ lINC + lHOVAL, data = col)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.361  -5.997  -1.521   7.432  28.013 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  138.074     14.542   9.495 2.06e-12 ***
## lINC         -19.272      5.024  -3.836 0.000379 ***
## lHOVAL       -14.924      4.568  -3.267 0.002059 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.6 on 46 degrees of freedom
## Multiple R-squared:  0.5392, Adjusted R-squared:  0.5191 
## F-statistic: 26.91 on 2 and 46 DF,  p-value: 1.827e-08

The model appears to be a fairly good one, with an \(F\)-statistic that is significant. However, given the pattern in the crime data seen on the map above, it is likely that the model may be impacted by autocorrelation, so we’ll now test for autocorrelation in the residuals, using Moran’s \(I\):

moran.mc(residuals(col.fit1), 
         listw = col.listw, 
         nsim = 999)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  residuals(col.fit1) 
## weights: col.listw  
## number of simulations + 1: 1000 
## 
## statistic = 0.21308, observed rank = 998, p-value = 0.002
## alternative hypothesis: greater

I’ve used the Monte Carlo version of Moran’s \(I\), but this could be replaced by the other approaches, depending on your assumptions about the spatial pattern.

Lagrange multiplier test

The Lagrange multiplier test is used to assess whether the autocorrelation is in the values of the dependent variable or in its errors, and helps in the choice of which spatial regression model to use. We first run this test, using the non-robust version. The tests are given by the parameter test - the full range can be found in the help page for the function.

lmt <- lm.LMtests(col.fit1, col.listw, test = c("LMerr","LMlag"))

summary(lmt)
##  Lagrange multiplier diagnostics for spatial dependence
## data:  
## model: lm(formula = CRIME ~ lINC + lHOVAL, data = col)
## weights: col.listw
##  
##       statistic parameter  p.value   
## LMerr    4.7913         1 0.028603 * 
## LMlag    9.1694         1 0.002461 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Higher values of the statistic indicate a more likely source of correlation. Where both are significant, the robust test (RLMerr and RLMlag) should be used to decide. These robust tests account for autocorrelation in one term, then test for remaining autocorrelation in the other term.

lmt_robust <- lm.LMtests(col.fit1, col.listw, test = c("RLMerr","RLMlag"))

summary(lmt_robust)
##  Lagrange multiplier diagnostics for spatial dependence
## data:  
## model: lm(formula = CRIME ~ lINC + lHOVAL, data = col)
## weights: col.listw
##  
##        statistic parameter p.value  
## RLMerr  0.024805         1 0.87485  
## RLMlag  4.402809         1 0.03588 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Evidence for significant autocorrelation is shown in the RLMlag test, suggesting a spatial lag model.

Spatial lag model

A spatial lag model can be fit to the data using the lagsarlm() function. The syntax follows that of most modeling functions in R, except that we need to give it a spatial weight matrix:

col.fit2 <- lagsarlm(CRIME ~ lINC + lHOVAL, 
                     data = col, 
                     col.listw)

summary(col.fit2)
## 
## Call:lagsarlm(formula = CRIME ~ lINC + lHOVAL, data = col, listw = col.listw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -36.1856  -5.4709  -0.5156   6.4413  22.5983 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept)  99.5700    16.1538  6.1639 7.099e-10
## lINC        -11.9221     4.5544 -2.6177 0.0088524
## lHOVAL      -13.5984     3.9965 -3.4026 0.0006676
## 
## Rho: 0.42092, LR test value: 8.9447, p-value: 0.0027828
## Asymptotic standard error: 0.12671
##     z-value: 3.3219, p-value: 0.00089398
## Wald statistic: 11.035, p-value: 0.00089398
## 
## Log likelihood: -183.6196 for lag model
## ML residual variance (sigma squared): 100.73, (sigma: 10.036)
## Number of observations: 49 
## Number of parameters estimated: 5 
## AIC: 377.24, (AIC for lm: 384.18)
## LM test for residual autocorrelation
## test value: 0.028921, p-value: 0.86496

There are several things to note in the output:

  • Estimates of the coefficients associated with the independent variables
  • The coefficient rho value, showing the strength and significance of the autoregressive spatial component
  • The LM test on residuals to look for remaining autocorrelation
  • The AIC and log-likelihood giving an estimate of the goodness-of-fit of the model

As with the previous model, we can test the residuals for any remaining autocorrelation:

moran.mc(residuals(col.fit2), 
         listw = col.listw, 
         nsim = 999)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  residuals(col.fit2) 
## weights: col.listw  
## number of simulations + 1: 1000 
## 
## statistic = 0.010517, observed rank = 641, p-value = 0.359
## alternative hypothesis: greater

Spatial error model

While the results of the Lagrange multiplier test indicated a spatial lag model as the most suitable method, we can also fit a spatial error model, using the errorsarlm() function:

col.fit3 = errorsarlm(CRIME ~ lINC + lHOVAL, 
                      data = col, 
                      col.listw)

summary(col.fit3)
## 
## Call:errorsarlm(formula = CRIME ~ lINC + lHOVAL, data = col, listw = col.listw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -33.3722  -6.3766  -0.1172   6.9844  21.9846 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) 117.7761    14.7937  7.9612 1.776e-15
## lINC         -9.9844     4.9172 -2.0305  0.042305
## lHOVAL      -16.0041     4.1761 -3.8323  0.000127
## 
## Lambda: 0.53456, LR test value: 6.9048, p-value: 0.0085966
## Asymptotic standard error: 0.14043
##     z-value: 3.8066, p-value: 0.00014091
## Wald statistic: 14.49, p-value: 0.00014091
## 
## Log likelihood: -184.6396 for error model
## ML residual variance (sigma squared): 101.71, (sigma: 10.085)
## Number of observations: 49 
## Number of parameters estimated: 5 
## AIC: 379.28, (AIC for lm: 384.18)

In the output of the function, note the value of lambda, the autoregressive coefficient representing the strength of autocorrelation in the residuals of a linear model. Note the AIC and compare to the previous model.

Spatial Durbin lag model

So far, we have only considered correlation between values of the dependent variable in any zone and it’s neighbors. An alternative source of spatial dependency might arise between values of a dependent variable and neighboring values of an independent variable. To model this, we can use a spatial Durbin model, that includes lagged independent variables in the neighboring zones. This also uses the lagsarlm() function, but with the parameter type set to ‘mixed’, to specify a Spatial Durbin lag model:

col.fit4 = lagsarlm(CRIME ~ lINC + lHOVAL, 
                    data = col, 
                    col.listw, 
                    type = 'mixed')

summary(col.fit4)
## 
## Call:lagsarlm(formula = CRIME ~ lINC + lHOVAL, data = col, listw = col.listw, 
##     type = "mixed")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -36.42025  -5.66457   0.10208   5.61973  21.82351 
## 
## Type: mixed 
## Coefficients: (asymptotic standard errors) 
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  95.4310    31.8319  2.9980 0.002718
## lINC         -9.6490     4.9167 -1.9625 0.049702
## lHOVAL      -15.2748     4.1030 -3.7229 0.000197
## lag.lINC    -13.6313     8.4749 -1.6084 0.107743
## lag.lHOVAL   11.8547     8.2710  1.4333 0.151776
## 
## Rho: 0.35363, LR test value: 3.4569, p-value: 0.062987
## Asymptotic standard error: 0.16701
##     z-value: 2.1174, p-value: 0.034225
## Wald statistic: 4.4834, p-value: 0.034225
## 
## Log likelihood: -182.0123 for mixed model
## ML residual variance (sigma squared): 95.663, (sigma: 9.7807)
## Number of observations: 49 
## Number of parameters estimated: 7 
## AIC: 378.02, (AIC for lm: 379.48)
## LM test for residual autocorrelation
## test value: 0.14333, p-value: 0.705

The output gives the coefficients for all variables, including the lagged versions. Are any of these significant?

Exercise

  1. The file usa48_usedcars.shp in directory ‘usedcars’ contains information on tax rates and delivery charges for new cars (tax_charge) in the 48 conterminous U.S. states for the period 1955-1959, as well as the average used car price for 1960 (price_1960). Use this dataset to build a spatial model linking used car prices to the tax and delivery costs. You will need the spdep library.
    • Build a neighborhood function linking the 48 states. You will need to choose one of the set of neighborhood functions available in R. Explain your choice of neighborhood function
    • Build a spatial weight matrix. Use the summary() function to get information about the distribution of links per state and report this
    • Use the moran.test() or moran.mc() function to test for spatial autocorrelation in the prices of used cars. Report the Moran’s \(I\) statistic, the z-score and whether or not you can reject the null hypothesis that there is no spatial autocorrelation
    • Build a simple linear model using the lm() function between the used car prices (dependent or Y variable) and the tax and delivery cost (independent or X variable). Report the coefficients of the model and the \(R^2\). Check for autocorrelation in the residuals of the model using the moran.mc() function, and state whether or not this is present
    • Use the Lagrange Multiplier test to identify whether to use a spatial error or spatial lag model. Remember that you may need to use the robust version of this test if non-robust results are both significant. Report the \(p\)-value of each test and give the model choice
    • Now build a spatial model linking the car prices and the tax/delivery cost, using the model you chose in the previous section (either use the lagsarlm() or errorsarlm() function). Report the following information:
      • If using a spatial lag model: a) coefficients (and their significance); b) the value of Rho (the spatial autoregressive coefficient); c) the AIC value and the AIC value of the linear model without the spatial weight matrix
      • If using a spatial error model: a) coefficients (and their significance); b) the value of lambda (the spatial autoregressive coefficient); c) the AIC value and the AIC value of the linear model without the spatial weight matrix
    • Test for remaining autocorrelation in the residuals of the model using the moran.test() or moran.mc() function. Given the value you obtain, state whether you think that the model adequately accounts for autocorrelation?
    • Is the coefficient relating tax and delivery to car price significant? If not, give one reason why this may not be the case

Appendix: Editing A Neighborhood

The spdep package includes a function for interactive editing of neighborhood structures. However, this does not currently work in RStudio due to issues with resizing the plotting window. Instead, you will need to run this from the base R application. As an example, open R (not RStudio!) from your application menu or start menu. Once open, use the ‘Misc’ menu and choose ‘Change working directory’ to browse to the lab files. Once there, load the NY data set, create the Syracuse subset and make a neighborhood structure:

library(sf)
library(spdep)
library(sp)

NY8 <- st_read("../datafiles/NY_data/NY8_utm18.shp")
Syracuse <- NY8[NY8$AREANAME == "Syracuse city", ]
Sy1_nb <- poly2nb(Syracuse)

Note that when you make a plot in base R, it opens a separate window to display this. Now, let’s edit this structure. This fucntion does not yet work with sf objects, so we’ll need to first convert the Syracuse using the sp library. Run the following code:

Syr2 <- as_Spatial(Syracuse)
coords <- coordinates(Syr2)
Sy2_nb <- edit.nb(Sy1_nb, coords, polys = Syr2)

This will display the polygons, centroids and neighborhood structure, and the console will display “Identifying contiguity for deletion”. You can now do two things:

  • If you select two centroids that are already linked, the function will ask if you want to “Delete this line (y/n)”. Enter y to do so, or n to ignore this and move on
  • If you select two centroids that are not linked, the function will ask if you want to “Add contiguity (y/n)”. Enter y to do so, or n to ignore this and move on

Each time you add or delete a link, the function will ask if you want to refresh, continue or quit. Continuing simply waits for you to select two more centroids. If you refresh, it will redraw the neighborhood structure highlighting which links have been deleted (dashed line) or added (yellow line). If you quit, then the interactive session will terminate and the new neighborhood structure will be written out. In the code above, we assign the output of the function to Sy2_nb. If you now plot this and the previous one, you should see the edits you have made. This new neighborhood structure can then be used in all the subsequent functions (Moran’s \(I\), spatial regression, etc.). In order to keep and re-use this structure in RStudio, make sure to save it. As the neighborhood object is somewhat complicated, the easiest way to do this is to write it to an R binary file (this saves both neighborhood structures to a single file):

save(Sy1_nb, Sy2_nb, file="Sy_nb.RData")

You can then load this in a new R or RStudio session, which will recreate these objects

load(file="Sy_nb.RData")